- Train-test split
- Cross-validation
- Bootstrap
10/1/2020
Recall the distinction between the test error and the training error:
\[Y = \beta_0 + \beta_1 x\]
\[Y = \beta_0 + \beta_1 x + \beta_2 x^2\]
\[Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\]
\[Y = ???\]
Ideal solution: have a large designated test set to evaluate performance. This is often not available!
So far we have evaluated accuracy with in-sample MSE (regression) and in-sample accuracy (classification).
Goal: estimate the prediction error of a fitted model.
A potential solution:
A perfectly sensible approach, but it has still some drawbacks!
A random splitting into two halves: left part is training set, right part is validation set.
Drawback 1: the estimate of the error rate can be highly variable.
Drawback 2: only some of the data points are used to fit the model.
It is a widely used approach for estimating test error. Randomly divide the data into \(K\) equally sized parts (\(K\) non-overlapping groups, or folds).
For fold \(k \text { in } 1:K\)
Estimate the error rate as:
\[CV_{(K)} = \frac{1}{K} \sum_{k=1}^K \text{Err}_k\]
Key idea: average over all possible testing sets of size \(n_{test} = 1\).
For observations \(i \text { in } 1:n\)
Estimate the error rate as:
\[CV_{(n)} = \frac{1}{n} \sum_{k=1}^n \text{Err}(y_i, \hat{y}_i)\]
Less bias than the train/test split approach:
No randomness:
Downside: must re-fit \(n\) times!
With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds
\[CV_{(n)} = \dfrac{1}{n} \sum_{i=1}^n \left( \dfrac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]
where \(\hat{y}_i\) is the \(i^{th}\) fitted value from the original least squares fit, and \(h_i\) is the leverage.
Key insight: there is a bias-variance tradeoff in estimating test error.
Variance comes from overlap in the training sets:
Typical values: \(K = 5\) to \(K = 10\) (no theory; a purely empirical observation).
We divide the data into \(K\) roughly equal-sized parts \(C_1, \dots, C_K\), where \(C_k\) denotes the indices of the observations in fold \(k\). There are \(n_k\) observations in fold \(k\): if \(n\) is a multiple of \(K\), then \(n_k = n/K\).
Compute \[CV_{(k)} = \frac{1}{K} \sum_{k=1}^K \text{Err}_k\] where \(\text{Err}_k = \sum_{i \in C_k} \mathcal{I}(y_i \neq \hat{y}_i)/n_k\).
Consider a simple binary classifier:
How do we estimate the test set performance of this classifier? Can we apply cross-validation in step 2, forgetting about step 1?
This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process.
Train on training set, validate on cv set to get \(\text{Err}_{CV,1}\).
Train on training set, validate on cv set to get \(\text{Err}_{CV,2}\).
Average cv errors \(\text{Err}_{CV,1}, \dots, \text{Err}_{CV,K}\)
\[\widehat{\text{Err}}_{CV} = \frac{1}{K} \sum_{k=1}^{K} \text{Err}_{CV,k}\]
We do this for several models, and choose the model with the smallest \(\widehat{\text{Err}}_{CV}\).
Calculate error on test set \(\widehat{\text{Err}}_{test}\). This is an unbiased estimate of the error!
Say that we have some data \(\mathbf{X} = (x_1, \dots, x_n)\) and we want to estimate the location parameter \(\mu\) via the classic estimator \(\overline{\mathbf{X}}\).
How can we quantify our uncertainty for this estimate?
Question: “how might my estimate \(\overline{\mathbf{X}}\) had been different if I had seen a different sample of \((x_i)\) pairs from the same population \(P(X)\)?”
Bootstrap consists in the following algorithm:
For \(b\) in \(1:B\)
This gives us \(B\) draws from the bootstrapped sampling distribution of \(\overline{\mathbf{X}}\). Use these draws to form (approximate) confidence intervals and standard errors for \(\mu\).
Key: approximate \(P(X)\) with \(\widehat{P}(X)\)!
We have a sample from a normal distribution, and want to quantify uncertainty around the mean parameter.
We then have a sample of estimates \((\overline{X}^{(1)}, \dots, \overline{X}^{(B)})\).